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Hand position recognition is very significant for human-computer interaction. 
Different kinds of devices and technologies can be used for data acquisition; 
each has its specification and accuracy, one of these devices is Kinect V2 
sensor. A three-dimensional location of the skeleton joints is taken from the 
Kinect device to create three types of data, the first is joint position raw data, 
the second is angles between joints, the third is combined of both types. These 
three types of data are used to train four classifiers, which are support vector 
machines, random forest, k nearest neighbors, and multilayer perceptron. The 
experiments are done on the datasets of 30,480 frames from 127 volunteers 
with saved trained models are used to predict and classify the eight positions 
of hand in a real-time system. The results show that our proposed approach 
performs well with highly efficient and accuracy reaching up to 99.07% in 
some cases and an average time spent on checking frame by frame 
sequentially very short period, and some cases, it reaches 0.59*10-3 seconds. 
This system can used in many applications such as controlling robots or 


devices, comparing physical exercises, or even monitoring elderly and 


Support vector machines 
patients, and more. 
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1. INTRODUCTION 

The Microsoft Kinect sensor V2 device is used in many scientific fields because of its specification 
like being cheap, very accurate [1], [2], easy to set up technology, and fast. To extract position skeleton data, 
Kinect provides to us the locations of 25 virtual anatomical joint trajectories which can be extracted from depth 
map with a per-pixel semantic segmentation algorithm [3], with the ability to track 6 people, the Kinect sensor 
provides a powerful software development kit (SDK). Its technology allowed many applications to be 
developed beyond the original scope of gaming, covering several categories like detection of the human body 
or a part of it, such as the face, hands, or legs, and distinguishing movements and gestures in the field of sign 
language, gait recognition as in research [4]-[9]. Also, to monitor patients and the elderly for healthcare or 
from falling and alert those concerned where one or several devices are used [10]-[12]. To monitor exercises 
with the design of an avatar to teach and display movements and compare the correctness of their 
implementation [10], [13]. Controlling the robot as a whole or as an arm through gestures or imitation of 
movements [6], [14], it has the possibility of implementation in real-time application [15], can be used as a 
scanner for 3D printing [16], and because artificial intelligence has a large income in controlling these areas. 
We apply multiple classification algorithms on three types of data extracted from the second version of the 


Journal homepage: http://ijeecs.iaescore.com 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 o 347 


Kinect to study, compare the effectiveness and accuracy of each classification method and apply used the best 
classifiers in an online test model. 

Kinect V2 Sensor is a device developed by Microsoft, where it is initially launched with the Xbox 
game console, and then a new version of it was released for Windows, Figure 1. The powerful Kinect features 
like two cameras: one that is color RGB and the other that is depth (with varying resolutions). The color camera 
has a resolution of 1920x1080 pixels, while the depth camera has a resolution of 512x424 pixels. At any given 
moment, Kinect can monitor up to six skeletons, each with 25 joints as shown in Figure 2(a). The joints are 
labeled with numbers ranging from 0 to 24 which are color (x, y), depth (x, y), camera coordinates (x, y, Z), 
and orientation (x, y, Z), these are the 11 attributes of each joint (x, y, z, w) as shown in Figure 2(b). Figure 3. 
represent output data of Kinect v2 and summarize point cloud computation. 

The Kinect's camera coordinates employ the infrared sensor to locate 3D locations in space where the 
joints are. These are the coordinates to utilize in 3D projects for joint placement. It's worth remembering that 
the Kinect skeleton returns "joints" rather than "bones" [17], what matters to us is the raw data represented by 
the three-dimensional locations of the skeletal joints, as we use it in the first type of data and we also use it to 
calculate the angles, which is the second type of data. 


RGB Camera 


IR Camera IR Emitter 


Figure 1. The face of the Kinect V2 sensor shows the placements of the cameras and emitters [18] 
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Figure 2. Information of joints data the Kinect V2 sensor's (a) joint map of a human skeleton, and 
(b) an example of one Kinect joint's 11 features [19] 
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Figure 3. Schematic representation of the output data of Kinect v2 and summary of point cloud 
computation [20] 
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Different classifiers are used in this research to classify the types of hand positions. In this research, 
we decided to detect and classify eight positions, which are: “hands up," “right hand up," “left hand up," 
“hands-on head," “arms open," “stand up straight," “hands-on waist," and “hands forward". By applying the 
following classifiers: (support vector machines (SVMs) [21],[22], k-nearest neighbors (kKNN) [23], random 
forests (RF) [24], multilayer perceptron (MLP) [25]). The goals of this research are: 

— Finding the best accurate classifier and using it in the system to distinguish movements that can be applied 
in simulators and robotics control. 

— Discover what kind of data derived from the skeleton provided by the Kinect device that can be used with 
classifiers and gives the best results in terms of speed and accuracy. 

— More efficient method of storing and retrieve trained model to reduce the time of training system. 

— Designed and implemented a fast system to use classifier on real-time recognition. 


2. RELATED WORKS 

Many researches have there attempts and approaches in this field, we present some of the recent 
researches related to the used classifiers in this paper. Adama, et al [26], offered an activity recognition learning 
system for use in assistive robots that uses an SVM classifier to learn everyday activity from 3D skeletal data. 
Byun and Lee [27], presented a survey for the use of SVMs in various applications. It was successful in 
applying it to several problems, including voice discrimination with knowledge of the speaker's identity, 
distinguishing faces with knowledge of his identity, knowing handwriting, and distinguishing numbers, and 
most results showed that RBF kernels were usually better than linear or polynomial kernels. 

Manzi et al. [28], described an activity detection system that uses machine learning techniques (a 
multiclass SVM trained using sequential minimal optimization (SMO)) to identify actions based on skeletal 
data taken from a depth camera. Li et al. [29], developed a system for action identification based on the skeleton 
by mining important skeleton postures using latent SVM. The research revealed that distinguishing human 
actions requires only a few frames with crucial skeletal postures. 

Arai and Andrie [30], created a 3D skeleton model, the Kinect sensor and Ipisoft motion capture 
program are used. Ipisoft is a specifically designed tool that allows users to design skeletons for their computer- 
generated characters. The knee angle feature will be extracted from the skeleton and used to quantify the gait 
disable quality. Anjum et al [31], created feature vectors based on the 3D location of these joints during the 
course of the activity, which are then utilized for SVM-based training and testing of activity identification for 
genuine human-robot interaction. 

Piyathilaka and Kodagoda [32], offered the notion of a spatial affordance map, which uses geometric 
aspects of the environment to learn about human context. Rather than watching real individuals in the 
environment, the suggested affordance mapping approach models interaction between the environment and 
humans using virtual humans. The spatial affordance map learning issue is stated as a multi-label classification 
problem that may be learned using SVM-based learners. Experiments on an actual 3D scene dataset yielded 
good results, demonstrating the use of the affordance-map for mapping human context. 

Elforaici et al. [33], created an automatic posture recognition system using an RGB-D camera 
(Kinect). They present two supervised algorithms for learning and detecting human poses using an RGB-D 
camera's multiple types of visual input. One method takes advantage of a three-dimensional configuration of 
body joints. The posture recognition is subsequently performed using the SVM classification of 3D skeleton- 
based properties. 

Han et al. [34], to reduce the potential injury caused by falls, this study proposes a two-stage fall 
detection system based on human postural features. They produced additional crucial characteristics for 
preprocessing in this study: deflection angles and spine ratio, to describe changes in human posture based on 
the human skeleton, and we classified using both SVM and kNN. Ubalde et al. [35], represented skeletal 
sequences as a bag of time-stamped descriptors, and they provide a new framework for action categorization 
based on the KNN approach. Ramirez et al. [36], this paper proposes a fall detection system based on camera 
vision that extracts features using a KNN classifier. 

Seungryul et al. [37], researched the challenge of activity recognition in a 24-hour monitoring scenario 
of patient actions in a hospital, the objective was to identify both static and dynamic actions successfully. They 
suggest using a kinematic-layout-aware random forest to encode scene layout and skeleton information as 
privileged information, collecting more geometry and kinematic-layout information, and improving action 
classification discriminative power. Laraba et al. [38], introduced a novel motion sequence representation that 
projects movement sequences into the RGB domain. Action classification becomes an image classification 
issue since the 3D coordinates of joints are transferred to values of red, green, and blue. Methods for classifying 
images at a basic level, such as SVM, KNN, RF, as well as CNN, were used to evaluate this representation. 
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Canavan et al. [39], suggested combining a random regression forest with a unique set of features 
descriptors built from bone data received from the leap motion controller to recognize automated hand gestures. 
Boissiere and Noumeir [40], proposed an end-to-end trainable network for human action identification utilizing 
skeleton and infrared data, with 2D CNN as a pose module extracting features from skeleton data and 3D CNN 
as an infrared module extracting visual characteristics from clips. Using a multi-layer perceptron, both feature 
vectors are then merged and explored together. Zhao et al. [41] describe a technique that uses various classifiers 
to identify people. By using static characteristics taken from Kinect skeletal data, and used classfiers (KNN, 
decision tree, Gaussian Naive Bayesian, MultiLayer perceptron, and SVM) to predect the conclution. 


3. PROPOSED METHOD 
Figure 4. show the diagram of proposed approach. That Use the Kinect v2 sensor and the above 
classifiers to represent following steps: 
— Build dataset (collect datasets using the Kinect skeleton). 
— Calculate angles. 
— Save data in three separate CSV files containing different types of data. 
— Train classifiers. 
— Store trained models by use the pickle method. 
— Real-time recognition using saved models. 


Skeleton model 


Kinect sensor 


Feature e 3D joint position 
—_ —— | e Calculate angles 
computation 


Save data in three 
CSV files: 


Joint Both (angles, 
data joint) data 


Train classifiers 


Real-time 
Detection and 
Classification 


Class type 


Store models 


Figure 4. Diagram of the proposed approach 


3.1. Build dataset 

The database we collected for eight fixed positions came from 127 volunteers (men and women), 
whose ages ranged from 20 to 41, with different heights (1.45-1.91 m) and different body sizes. Each person 
from the volunteers imitates or performs the eight positions or poses: “hands up”, “right hand up”, “left hand 
up”, “hands-on head”, “arms open”, “stand up straight”, “hands-on waist" and “hands forward” as shown in 
Figure 5, interspersed with a simple movement that falls under the same position. For each person, we record 
240 frames (each frame contains 15 joint camera coordinates in X, Y, and Z, and 6 angles). The record total 


frames are 30,480 frames. 72% are used for training data and 28% are used for testing data. 
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Figure 5. Eight hand positions 
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3.2. Calculate angles 

If we have space coordinate positions of their joint points, we can calculate an angle by using three 
3D points to make space vectors between them. Like vector (ER-SR) (SR-SS), where ER represents the joint 
point of the elbow right, SR represents the joint point of the shoulder right, and SS represents the joint point of 
the shoulder spine. As shown in Figure 6. 


SR 


Figure 6. Diagram of joint angle 


By assuming the coordinates of the elbow-right joint point are (x1, yı, Z1), the coordinates of the 
shoulder-right joint point are (X2, y2, Z2), and the joint point coordinates of the spine-shoulder are (x3, y3, Z3), 
then the vector a=(x2—X1, y2-y1, Z2-Z1), vector b=(x3—X2, y3—Y2, Z3—-Z2), Assume (a, b) included angle is a, then: 


cosa = an (1) 
a.b = (x3 — x1) (x3 — x2) + W2 — V1) 3 — V2) + (22 — 21) (23 — 22) (2) 
la| = 4) =x)? + 2-1)? + (42 - 21)? (3) 
|b| = (x3 — x2)? + (Y3 — y2)? + (23 — 22)? (4) 


to get the angle between the vectors created by the three essential bone joint sites joined in pairs, substitute the 
following equations into (1)-(4). This strategy was used by Liu et al. [12]. 


3.3. Save data 

This research is based on distinguishing the upper half of the body, specifically the location of the 
hands, we focused on the 15 upper joints and the angles that determine the movement of the hands. For this, 
the lower half does not affect the determination of the movements adopted in the search, to reduce processing 
operations we saved data in three separate files. First file used to save joints coordinate (X, Y, Z) of upper 
joints (head, nick, spin shoulder, spin mid, spin base, shoulder (left, right), elbow (left, right), wrist (left, right), 
hand (left, right), hip (left, right)), second file to save calculate six angles shown in Figure 7 which is shoulder 
angle calculated using points (spine shoulder—shoulder-elbow), elbow angle calculated using points (shoulder- 
elbow-wrist), wrist angle calculated using points (elbow—wrist -hand) for right and left side, The third file is 
used to save data by combining the first and second files, meaning we use both joints and angles to train the 
algorithm. 


Figure 7. Positions of the six calculated angles 
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3.4. Train the classifiers 

Three types of data to are used to train the classifiers: the first comes from the Kinect device 
represented by the skeleton joint coordinate position; the second is the calculation of six angles shown in Figure 
7 which are calculated by using the aforementioned method and the third type of data used for training is by 
using the joints and angles together. These datasets are used to train a set of classifiers (SVM, random forest, 
k-nearest neighbors, multilayer perceptron), as mentioned above 72% from the dataset are used for training the 
classifiers. 


3.5. Store models 

It is known that training any algorithm takes a longer time than the rest of the steps. To shorten the 
time and not have to repeat the training of the classifier at each run of the real-time system, we used a method 
to save the module after it has been trained and load them when needed. Using Python’s built-in persistence 
model, namely pickle, and use the models in real-time classifiers as shown in Figure 4. 


3.6. Real-time detection and classification 

After training the classifier and saving it as a pickle, the stage of using the classifier to distinguish 
patterns begins with running a special program written in visual basic by C++ language to choose the type of 
classifier and the type of data Figure 8 that used in real-time detection system. After that, loading the saved 
model based on the choice and starting the Kinect device to track the person and send his data to a Python 
script that extracts the data from each frame individually and stores it in the form of a list. 

According to the type of data to be classified, if it is of the first type the data of the skeleton joints 
shall be placed in the list. And if it is of the second type the required angles shall be placed after calculating 
them, and if it is the third type each of the previous two types is placed and sent. Then the classifier makes the 
prediction and displays it on the screen as shown in Figure 9. 


Right Hand Up 


O 


Figure 8. Online hand position detection and classification system; the main window 


Figure 9. An example of real-time recognition 


4. EXPERIMENTAL RESULTS 

Applying the classifiers using our written code with Python version 3.9 and the scikit-learn version 
1.0.1 libraries [42]. These tests were done on a computer with following specifications: Software (Microsoft 
window 10 Pro 64-bit version 21H2). Hardware (processor: Intel Core 17-4510U 2000 GHz, memory: 16 GB, 
harddisk: 1 TB SSD). From the implementations of the classifiers, the following experimental results are 
examined to determine which one is the best classifier based on the accuracy and the kind of the used data. As 
we can see in Tables 1-2, the classifiers achieve the best performance on point data, except for random forests, 
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which have the best accuracy on the third type of data. The most important thing is that the accuracy of 
classifiers, in some cases, exceeded 93 percent and reached 99 percent in MLP and SVM with the poly kernel. 


Table 1. The classifier test result of SVM types on three types of data 


SVM with Linear kernel SVM with Polynomial kernel SVM with RBF kernel 
Data Position Name Preci Reca Precisi F1- 
Type : : Fl-score Precision Recall Fl-score Recall 
sion ll on score 
hands up 0.31 0.28 0.3 0.41 0.42 0.41 0.52 0.4 0.45 
right hand up 0.62 0.61 0.61 0.55 0.53 0.54 0.64 0.57 0.61 
left hand up 0.48 0.39 0.43 0.45 0.4 0.42 0.52 0.39 0.44 
Angles hands on head 0.86 0.93 0.89 0.9 0.89 0.9 0.89 0.93 0.91 
Dataset arms open 1 0.93 0.96 0.97 0.97 0.97 0.99 0.94 0.96 
stand up straight 0.44 0.79 0.57 0.53 0.74 0.62 0.5 0.76 0.6 
hands on waist 0.8 0.85 0.83 0.91 0.87 0.89 0.91 0.86 0.88 
hands forward 0.54 0.24 0.34 0.57 0.45 0.5 0.57 0.63 0.6 
Accuracy 62.72% 65.76% 68.54% 
hands up 0.98 0.94 0.96 0.98 0.99 0.99 0.96 0.98 0.97 
right hand up 0.93 1 0.96 0.99 1 1 1 1 1 
left hand up 0.99 0.99 0.99 1 1 1 1 0.99 1 
Points hands on head 1 0.95 0.98 0.99 0.99 0.99 0.98 0.97 0.97 
Dataset arms open 0.98 0.99 0.98 1 0.99 1 0.97 0.99 0.98 
stand up straight 0.94 1 0.97 0.97 0.99 0.98 0.97 0.99 0.98 
hands on waist 1 0.97 0.98 0.99 0.97 0.98 0.99 0.96 0.98 
hands forward 1 0.96 0.98 1 0.99 0.99 1 0.97 0.98 
Accuracy 97.48 % 99.07 % 98.26 % 
hands up 0.96 0.99 0.97 0.43 0.34 0.37 0.37 0.27 0.31 
right hand up 0.99 0.99 0.99 0.67 0.62 0.64 0.64 0.57 0.6 
left hand up 0.99 0.98 0.99 0.57 0.46 0.51 0.56 0.43 0.48 
Both hands on head 0.99 0.96 0.98 0.84 0.95 0.89 0.86 0.95 0.9 
Dataset arms open 0.95 0.99 0.97 1 0.93 0.96 1 0.93 0.96 
stand up straight 0.85 1 0.92 0.46 0.83 0.59 0.42 0.85 0.57 
hands on waist 1 0.86 0.92 0.79 0.89 0.84 0.8 0.89 0.84 
hands forward 1 0.92 0.96 0.6 0.29 0.39 0.57 0.26 0.36 
Accuracy 96.16 % 66.41% 64.34% 
Table 2. The classifier test result of k-NN, RF, and MLP on three types of data 
Data Position Name K Nearest Neighbors Random Forests Multilayer Perceptron 
Type Precision Recall Fl-score Precisi Recall Fl-score Precision Recall F1- 
on score 
hands up 0.39 0.37 0.38 0.45 0.35 0.4 0.28 0.32 0.3 
right hand up 0.55 0.58 0.57 0.59 0.63 0.61 0.63 0.53 0.58 
left hand up 0.5 0.45 0.47 0.52 0.47 0.5 0.45 0.39 0.42 
Angles hands on head 0.89 0.91 0.9 0.92 0.93 0.92 0.75 0.9 0.82 
Dataset arms open 0.98 0.93 0.95 1 0.94 0.97 0.9 0.91 0.91 
stand up straight 0.49 0.72 0.58 0.61 0.65 0.63 0.51 0.72 0.6 
hands on waist 0.86 0.85 0.85 0.9 0.85 0.87 0.81 0.66 0.73 
hands forward 0.63 0.42 0.5 0.56 0.72 0.63 0.54 0.4 0.46 
Accuracy 65.33% 69.27 % 60.35 % 
hands up 0.81 0.72 0.76 0.85 0.88 0.86 0.98 1 0.99 
right hand up 0.99 0.96 0.97 0.99 1 0.99 1 1 1 
left hand up 1 0.96 0.98 0.98 1 0.99 1 0.99 1 
Points hands on head 0.69 0.81 0.75 0.95 0.78 0.86 0.99 0.99 0.99 
Dataset arms open 1 0.94 0.97 0.75 0.99 0.85 1 0.99 1 
stand up straight 0.74 0.52 0.61 0.9 0.85 0.87 0.95 1 0.97 
hands on waist 0.58 0.83 0.68 0.83 0.89 0.86 0.99 0.94 0.97 
hands forward 0.96 0.87 0.91 0.96 0.75 0.84 0.99 0.99 0.99 
Accuracy 82.67 % 89.19 % 98.79 % 
hands up 0.55 0.5 0.52 0.92 0.87 0.9 0.93 0.82 0.87 
right hand up 0.67 0.7 0.68 0.99 1 0.99 0.99 0.98 0.98 
left hand up 0.63 0.57 0.6 0.98 1 0.99 0.94 0.98 0.96 
Both hands on head 0.89 0.91 0.9 0.89 0.94 0.91 0.83 0.93 0.88 
Dataset arms open 0.96 0.93 0.94 0.89 0.99 0.94 1 0.93 0.96 
stand up straight 0.6 0.8 0.69 0.93 0.99 0.96 0.96 0.99 0.97 
hands on waist 0.86 0.87 0.87 0.98 0.95 0.97 0.9 0.96 0.93 
hands forward 0.72 0.56 0.63 0.99 0.82 0.9 0.96 0.91 0.93 
Accuracy 73.07 % 94.29% 93.63% 
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We also noticed that the classifiers do not work correctly when using angles data in some actions, 
specifically in the movements of hands forward, some errors occur, as shown in the confusion matrix in Tables 
3-5. We also noticed when using angles data, the accuracy is lower than the rest. The reason may be the fact 
that the ranges of these angles are not large enough. Furthermore, the values are similar in most of the 
movements or contain more noise. 


Table 3. Confusion matrix of SVM with Linear kernel and SVM with Poly kernel classifiers on three types of 


data 
Dat. True 
oe position SVM with Linear kernel SVM with Polynomial kernel 
type label 
handsup 298 144 107 91 0 281 102 27 438 133 92 55 1 232 31 68 
right 59 636 103 30 0 150 13 59 117 555 139 0 0 138 0 101 
hand up 
‘a hand 177 98 409 6 0 2909 7 63 159 168 419 5 0 25 7 6 
Angl ae n o 0 6 977 0 7 21 39 4 5 20 935 1 0 18 28 
es 
Datas “8 27 2 1 0 977 0 4 0 14 1 3 0 1015 0 15 2 
et open 
stand up g9 18 8 0 0 825 0 33 98 32 92 2 0 776 2 48 
straight 
handson ie Bg 2 g o0 14 890 1 62 16 0 6 0 10 917 39 
walst 
hands 199 105 137 30 1 291 30 257 135 108 167 32 29 90 20 469 
forward 
handsup 983 31 12 0 24 0 0 0 i: 0 3 11 0 oO oO 0 
right 0 1050 0 0 0 0 0 0 0 1050 0 0 o oO 0 0 
hand up 
oe hand. si 0 1%0 0 0 10 0 0 0 0 1050 o o0 o o0 0 
Points Padson g 40 0 1002 0 0 0 0 7 0 0 1043 0 0 0 0 
head 
Datas- aris 
et 8 o 0 0 102 0 0 0 9 0 0 0 1041 o o0 0 
open 
standup 0 0 0 0 0 1047 3 0 0 0 0 0 0 1044 6 0 
straight 
handson 26 0 0 0 0 33 1017 0 0 0 0 0 0 31 1019 0 
waist 
hands 0 13 0 29 0 108 0 9 o0 0 2 0 1039 
forward 
handsup 1036 0 14 0 0 0 0 0 352 120 102 103 0 262 85 2⁄4 
right 0 1036 0 0 14 0 0 67 646 7 30 166 19 45 
hand up 
aa hand o 0 1034 2 0 14 0 0 42 75 484 11 0 319 42 77 


hands on 
Both head 38 0 0 1012 0 0 0 0 1 0 0 1001 0 4 17 27 


Dates arms 6 2 1 0 14 0. 0 0 5% 0 0 0 96 0 24 0 
et open 
standup o 0 0 oO 0 1048 2 0 51 16 7 0 0 87 3 2322 
straight 
handson 6 © © o0 0 14 9% 2 %2 ı ı 9 0 7 938 2 
waist 
hands 1 0 8 50 14 0 967 173 111 109 35 1 263 51 307 
forward 
Fredie left han- stand left han ar a ban han 
ted right ar hands hands han- right ds ds 
S han- ha ds up : ha ms up ds 
positi- hand ms | on wai- forw- ds hand on on 
ds up nd on straig- nd op stra .  forw 
on up up head P ht st ard up up oe he Ped = wai od 
label p ad st 
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Table 4. Confusion matrix of SVM with RBF kernel and kNN classifiers on three types of data 


True 
Data positi- 5 
SVM with RBF kernel KNN 
type on 
label 
hands 421 123 51 75 0 247 36 97 389 161 62 66 1 219 77 75 
up 
right 78 603 8&5 0 0 149 15 120 95 610 150 0 0 150 5 40 
hand up 
left 63 120 408 8 14 344 3 90 110 134 476 2 17 251 9 51 
hand up 
hands 0 0 8 978 0 0 9 55 23 0 3 956 0 0 25 43 
Angl on head 
es arms 11 0 16 0 982 0 9 32 23 32 15 0 975 0 3 2 
Data open 
set stand 78 26 8&8 1 0 798 0 59 103 60 109 0 0 753 6 19 
up 
straight 
hands 63 12 6 5 0 2 906 56 90 9 0 21 3 12 890 25 
on 
waist 
hands 94 56 124 30 0 62 22 662 162 100 141 35 0 148 25 439 
forward 
hands 1032 0 0 18 0 0 0 0 755 0 0 278 0 0 0 17 
up 
right 0 1050 0 0 0 0 0 O 12 1008 O 0 0 0 30 0 
hand up 
left 0 0 1044 6 0 0 0 0 o0 0 1011 9 0 15 15 0 
hand up 
hands 36 0 0 1014 0 0 0 0 144 0 0 853 0 30 0 23 
Poin on head 
ts arms 9 0 0 0 1041 0 0 0 25 0 0 8 985 2 30 0 
Data open 
set stand 0 0 0 0 0 1044 6 0 o0 0 0 0 0 551 499 0 
up 
straight 
hands 0 0 0 0 0 37 1011 2 0 0 0 30 O 147 873 0 
on 
waist 
hands 0 3 0 0 29 0 0 1018 0 11 0 59 0 0 71 909 
forward 
hands 280 132 82 84 0 330 113 29 525 172 63 71 5 97 66 51 
up 
right 63 594 84 27 0 216 15 51 45 735 82 0 0 135 14 39 
hand up 
left 68 63 448 13 0 373 16 69 104 38 596 3 17 237 8 47 
hand up 
hands 0 0 0 997 0 4 21 28 28 1 4 955 0 0 24 38 
Both °" head 
arms 38 4 0 0 976 0 22 10 19 38 8 0 975 0 8 2 
Data 
set open 
stand 52 18 60 o0 0 897 4 19 58 41 76 0 0 845 8 22 
up 
straight 
hands 89 2 1 8 0 7 937 6 38 8 1 16 20 17 917 33 
on 
waist 
hands 166 114 132 29 1 292 40 276 135 71 112 31 0 86 25 590 
forward 
Pred ri han a han han ri han sta Han 
icted han m Ë as aO ond gs ds han mo Ë as SO nd as han 
$ ha ms up ha ms up ds 
posit ds ha on on for ds ha on on 
A nd op stra 3 nd op stra 3 forw 
ion up nd he : wai wa up nd he f wai d 
label up up ad pa R st rd up up ad ER $ st in 


It is worth noting that the use of any classifier model saved in real-time testing will work without 
problems or delays in the presentation, Table 6 The table shows the average time taken to test each frame and 
show the results. We note that the best classifier is MLP in terms of speed, then SVM with Linear kernel 
follows, and the slowest classifier is random forest, but all falls within the real-time of the test. 
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Table 5. Confusion matrix of RF and MLP classifiers on three types of data 
True 
paa position RF MLP 
p label 
hands 370 169 93 45 0 163 60 150 180 214 100 108 36 222 90 100 
up 
right 82 657 165 0 0 57 0 89 57 687 115 0 14 103 0 74 
hand up 
left hand 89 129 496 9 0 157 0 170 44 137 438 10 33 285 1 10 
up 2 
Angl hands 12 0 4 977 0 2 13 42 15 6 12 895 0 1 77 44 
es on head 
Datas arms 10 0 31 0 987 0 14 8 1 3 41 0 974 4 19 8 
et open 
stand up 109 78 99 2 0 682 1 79 18 49 81 0 0 853 1 48 
straight 
hands 53 17 1 7 0 20 894 58 79 31 2 19 46 0 822 51 
on waist 
hands 97 59 65 25 2 34 12 756 84 124 153 32 2 88 47 52 
forward 0 
hands 920 0 15 39 60 0 0 16 1047 0 0 3 0 0 0 0 
up 
right 0 1050 0 0 0 0 0 0 0 1050 0 0 0 0 0 0 
hand up 
left hand 0 0 1050 0 0 0 0 0 0 0 1045 5 0 0 0 0 
up 
Poin- hands 156 0 0 820 61 0 0 13 10 0 0 1040 0 0 0 0 
ts on head 
Datas arms 8 0 0 0 1042 0 0 0 10 0 0 0 1040 0 0 0 
et open 
stand up 0 0 0 0 0 888 162 0 0 0 0 0 0 1047 3 0 
straight 
hands 0 0 0 0 9 104 937 0 0 0 0 0 0 52 990 8 
on waist 
hands 0 11 3 0 217 0 32 785 0 2 0 0 0 0 0 1048 
forward 
hands 918 0 15 117 0 0 0 0 973 1 18 57 1 0 0 0 
up 
right 0 1050 0 0 0 0 0 0 0 1050 0 0 0 0 0 0 
hand up 
left hand 0 0 1050 0 0 0 0 0 0 0 1011 0 0 39 0 0 
up 
hands 68 0 0 982 0 0 0 0 62 14 3 970 0 0 0 1 
Both 
Dat. on head 
at ams 9 0 0 0 10836 0 0 5 1 0 0 0 1039 0 0 0 
open 
stand up 0 0 0 0 0 1044 6 0 0 0 0 0 0 1038 12 0 
straight 
hands 0 0 0 0 0 52 998 0 0 0 0 3 0 151 890 6 
on waist 
hands 0 13 4 0 127 28 13 865 0 29 0 27 0 0 1 993 
forward 
i sta 
Predi : : s han rig han 
cted han Ri left hands ar stand ds hands han ht Iii ds oo nd hands hands 
ae ght ha ms up h ms up 
positi ds on ie ast on forwa ds ha on on for 
hand nd op- straig k and op str Í 
on up head wai rd up nd he ; waist ward 
label up up en ht å ap up ad en aig 
i ht 
Table 6. Average time to test one frame in online test 
oe : Average time O 5 Average time 
Classifier Data type (second) Classifier Data type (second) 
SVM with Li Angles 0.00232 Angles 0.00255 
Fae Points 0.00088 kNN Points 0.01190 
Both 0.00082 Both 0.01518 
SVM with Pol ial Angles 0.00164 Angles 0.02546 
barre Points 0.00124 RF Points 0.02549 
Both 0.00316 Both 0.03023 
Angles 0.00424 Angles 0.00078 
SVM with RBF kernel Points 0.00325 MLP Points 0.00072 
Both 0.00677 Both 0.00059 


5. CONCLUSION 


In this research, we tested three types of data extracted from the skeleton of the Kinect device on four 
classifiers with the presentation of the results. The classifier that achieved the best performance on points data 
is random forests, which had the best accuracy on the third type of data. It is observed that high results achieved 
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up to 99% in SVM with polynomial kernel and 98.79% in MLP by using points data. Post-training classifiers 
can be used to save in the model, and the saved model can be used for real-time detection and classification. 
In the test procedure, results demonstrated that human position can be recognized by only one frame of data, 
by examining the incoming data sequentially for each frame. Numbers of problems or difficulties occurred, 
including the inability to train some classifiers, such as SVM with the polynomial kernel, which failed to 
classify data above the 4th degree, and the time it takes to train SVM is longer than other classifiers. In future 
work, we will study the use of other algorithms with the possibility of linking them with devices to execute 
orders, or even using raspberry pi instead of PC. 
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